External dataset analysis - PXD052728
1. Dataset and goal
We work with:
- PRIDE ID: PXD052728
- Study: Human 2D and 3D in vitro models of muscular dystrophies
- Disease models: Duchenne (DD) and Becker (MD) muscular dystrophies
- Controls: Healthy control (HC) samples
- Model types: 2D monolayer and 3D organoid cultures
This dataset is particularly relevant for studying dystrophin-associated protein changes, extracellular matrix remodeling, and muscle fiber degeneration markers in cellular models.
Goal for students:
- Load and clean a MaxQuant proteinGroups file.
- Build a
protein_matrixandsample_metadata. - Reproduce key QC and EDA plots from Day 2.
- Fit a limma model (Day 3) and inspect a volcano plot and a heatmap.
- Perform a simple functional enrichment (Day 4).
Exercise 1.1 What are the main differences between this dataset and the internal one used on Day 2 (organism, model, number of samples)?
2. Loading and preprocessing proteinGroups
2.1 Read proteinGroups
Students should make sure the correct file is present in the working folder.
file_path <- "proteinGroups.txt" # adapt if needed
pg_raw <- readr::read_tsv(
file_path,
col_types = cols(.default = "c"),
na = c("", "NA", "NaN")
) |>
readr::type_convert()
dim(pg_raw)## [1] 3760 157
## [1] "Protein IDs" "Majority protein IDs"
## [3] "Peptide counts (all)" "Peptide counts (razor+unique)"
## [5] "Peptide counts (unique)" "Protein names"
2.2 Remove contaminants, reverse, site only
Same cleaning rules as in the course: remove proteins identified by site only, reverse database hits, and potential contaminants.
pg_filtered <- pg_raw |>
dplyr::filter(
is.na(`Only identified by site`),
is.na(Reverse),
is.na(`Potential contaminant`)
)
dim(pg_filtered)## [1] 3598 157
Exercise 2.1 How many rows are removed by this filtering step? Hint: compare
nrow(pg_raw)andnrow(pg_filtered). What percentage of the original dataset is retained?
2.3 Intensity columns and sample IDs
# Intensity columns
intensity_cols <- grep("^Intensity ", colnames(pg_filtered), value = TRUE)
# Raw labels like "Intensity 2D,DD-1" -> "2D,DD-1"
raw_labels <- gsub("^Intensity\\s+", "", intensity_cols)
# Sample IDs like "2D,DD-1" -> "2D_DD_1"
sample_ids <- raw_labels |>
gsub(",", "_", x = _) |>
gsub("-", "_", x = _)
intensity_cols## [1] "Intensity 2D,DD-1" "Intensity 2D,DD-2" "Intensity 2D,HC-1"
## [4] "Intensity 2D,HC-2" "Intensity 2D,MD-1" "Intensity 2D,MD-2"
## [7] "Intensity 3D,DD,H-1" "Intensity 3D,DD,H-2" "Intensity 3D,DD,H-3"
## [10] "Intensity 3D,DD,L-1" "Intensity 3D,DD,L-2" "Intensity 3D,DD,L-3"
## [13] "Intensity 3D,DD-1" "Intensity 3D,DD-2" "Intensity 3D,HC-1"
## [16] "Intensity 3D,HC-2" "Intensity 3D,MD,H-1" "Intensity 3D,MD,H-2"
## [19] "Intensity 3D,MD,H-3" "Intensity 3D,MD,L-1" "Intensity 3D,MD,L-2"
## [22] "Intensity 3D,MD,L-3" "Intensity 3D,MD-1" "Intensity 3D,MD-2"
## [1] "2D_DD_1" "2D_DD_2" "2D_HC_1" "2D_HC_2" "2D_MD_1" "2D_MD_2"
## [7] "3D_DD_H_1" "3D_DD_H_2" "3D_DD_H_3" "3D_DD_L_1" "3D_DD_L_2" "3D_DD_L_3"
## [13] "3D_DD_1" "3D_DD_2" "3D_HC_1" "3D_HC_2" "3D_MD_H_1" "3D_MD_H_2"
## [19] "3D_MD_H_3" "3D_MD_L_1" "3D_MD_L_2" "3D_MD_L_3" "3D_MD_1" "3D_MD_2"
2.4 Filter proteins by quantification rate
We keep proteins quantified in at least 80% of samples to ensure robust quantification.
quant_per_protein <- rowSums(!is.na(pg_filtered[intensity_cols])) /
length(intensity_cols)
summary(quant_per_protein)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
good_protein_ids <- pg_filtered$`Majority protein IDs`[quant_per_protein >= 0.8]
pg_good <- pg_filtered |>
dplyr::filter(`Majority protein IDs` %in% good_protein_ids)
dim(pg_good)## [1] 3598 157
Exercise 2.2 Change the threshold from 80 percent to 60 percent and see how the number of proteins changes. What is the trade off between more proteins and missing values?
2.5 Build protein_matrix and annotation
We use stable protein IDs as row names and keep gene names in a separate table.
protein_matrix <- pg_good |>
dplyr::select(all_of(intensity_cols)) |>
as.matrix()
rownames(protein_matrix) <- pg_good$`Majority protein IDs`
colnames(protein_matrix) <- sample_ids
# log2 transform
protein_matrix <- log2(protein_matrix + 1)
cat("Matrix dimensions:", dim(protein_matrix), "\n")## Matrix dimensions: 3598 24
## Number of missing values: 0
protein_annotation <- pg_good |>
dplyr::transmute(
protein_id = `Majority protein IDs`,
gene_names = `Gene names`,
gene_symbol = sub(";.*", "", `Gene names`),
protein_name = `Protein names`
)
head(protein_annotation)## # A tibble: 6 × 4
## protein_id gene_names gene_symbol protein_name
## <chr> <chr> <chr> <chr>
## 1 D2Y6Y7;W8D5L5;X5BSN2;X2CNL3;V5JH69;X2C587… ATP6;atp6… ATP6 ATP synthas…
## 2 P03915;Q8W946;Q8WCX8;Q8WCX1;U3M2W8;Q8WCW9… MT-ND5;ND… MT-ND5 NADH-ubiqui…
## 3 Q8HNQ2;Q8WCY5;V9JI12;V9JCG6;X5BMX5;X5BDH6… COX1;COI;… COX1 Cytochrome …
## 4 P03905;V9N518;V9J0Z6;U3L6N6;U3L6U8;U3L8P3… MT-ND4;ND… MT-ND4 NADH-ubiqui…
## 5 A0A075X871;H9QNC0;H9QP82;H9QQ96;H9QVE0;H9… COX3;CO3;… COX3 Cytochrome …
## 6 A0A6B9R9Q2;A0A075XDK3;R9Y638;R9Y4I0;R9Y3S… ND3;NADH3… ND3 NADH-ubiqui…
2.6 Build sample_metadata
Extract metadata from sample names to create a structured annotation table.
sample_metadata <- data.frame(
sample_id = sample_ids,
raw_label = raw_labels,
stringsAsFactors = FALSE
)
# Extract dimension (2D vs 3D)
sample_metadata$dimension <- ifelse(
startsWith(sample_metadata$raw_label, "2D"),
"2D",
"3D"
)
# Extract condition (DD, HC, MD)
sample_metadata$condition <- dplyr::case_when(
grepl("DD", sample_metadata$raw_label) ~ "DD",
grepl("HC", sample_metadata$raw_label) ~ "HC",
grepl("MD", sample_metadata$raw_label) ~ "MD",
TRUE ~ NA_character_
)
# Extract level (H, L, or None)
sample_metadata$level <- dplyr::case_when(
grepl(",H-", sample_metadata$raw_label) ~ "H",
grepl(",L-", sample_metadata$raw_label) ~ "L",
TRUE ~ "None"
)
# Extract replicate number
sample_metadata$replicate <- as.numeric(
sub(".*-(\\d+)$", "\\1", sample_metadata$raw_label)
)
sample_metadata## sample_id raw_label dimension condition level replicate
## 1 2D_DD_1 2D,DD-1 2D DD None 1
## 2 2D_DD_2 2D,DD-2 2D DD None 2
## 3 2D_HC_1 2D,HC-1 2D HC None 1
## 4 2D_HC_2 2D,HC-2 2D HC None 2
## 5 2D_MD_1 2D,MD-1 2D MD None 1
## 6 2D_MD_2 2D,MD-2 2D MD None 2
## 7 3D_DD_H_1 3D,DD,H-1 3D DD H 1
## 8 3D_DD_H_2 3D,DD,H-2 3D DD H 2
## 9 3D_DD_H_3 3D,DD,H-3 3D DD H 3
## 10 3D_DD_L_1 3D,DD,L-1 3D DD L 1
## 11 3D_DD_L_2 3D,DD,L-2 3D DD L 2
## 12 3D_DD_L_3 3D,DD,L-3 3D DD L 3
## 13 3D_DD_1 3D,DD-1 3D DD None 1
## 14 3D_DD_2 3D,DD-2 3D DD None 2
## 15 3D_HC_1 3D,HC-1 3D HC None 1
## 16 3D_HC_2 3D,HC-2 3D HC None 2
## 17 3D_MD_H_1 3D,MD,H-1 3D MD H 1
## 18 3D_MD_H_2 3D,MD,H-2 3D MD H 2
## 19 3D_MD_H_3 3D,MD,H-3 3D MD H 3
## 20 3D_MD_L_1 3D,MD,L-1 3D MD L 1
## 21 3D_MD_L_2 3D,MD,L-2 3D MD L 2
## 22 3D_MD_L_3 3D,MD,L-3 3D MD L 3
## 23 3D_MD_1 3D,MD-1 3D MD None 1
## 24 3D_MD_2 3D,MD-2 3D MD None 2
Exercise 2.3 How many samples are 2D vs 3D? How many are DD, HC, MD? Are the replicates balanced across conditions?
3. Day 2 style QC and EDA
3.1 Summary statistics per sample
summary_stats <- data.frame(
sample_id = colnames(protein_matrix),
mean = apply(protein_matrix, 2, mean, na.rm = TRUE),
median = apply(protein_matrix, 2, median, na.rm = TRUE),
sd = apply(protein_matrix, 2, sd, na.rm = TRUE),
n_missing = apply(protein_matrix, 2, function(x) sum(is.na(x)))
) |>
dplyr::left_join(sample_metadata, by = "sample_id")
summary_stats## sample_id mean median sd n_missing raw_label dimension condition
## 1 2D_DD_1 15.81524 21.97472 11.79927 0 2D,DD-1 2D DD
## 2 2D_DD_2 14.86546 21.16362 11.78712 0 2D,DD-2 2D DD
## 3 2D_HC_1 15.14810 22.19629 12.39780 0 2D,HC-1 2D HC
## 4 2D_HC_2 13.68269 20.46523 11.91532 0 2D,HC-2 2D HC
## 5 2D_MD_1 14.23071 20.99648 12.27349 0 2D,MD-1 2D MD
## 6 2D_MD_2 12.60573 18.80266 12.10514 0 2D,MD-2 2D MD
## 7 3D_DD_H_1 15.99881 24.30226 13.47697 0 3D,DD,H-1 3D DD
## 8 3D_DD_H_2 11.56733 0.00000 12.95062 0 3D,DD,H-2 3D DD
## 9 3D_DD_H_3 15.86178 24.24190 13.54666 0 3D,DD,H-3 3D DD
## 10 3D_DD_L_1 11.54948 0.00000 13.13548 0 3D,DD,L-1 3D DD
## 11 3D_DD_L_2 14.05422 22.25199 13.67180 0 3D,DD,L-2 3D DD
## 12 3D_DD_L_3 13.08635 0.00000 13.64021 0 3D,DD,L-3 3D DD
## 13 3D_DD_1 13.12409 19.27052 12.14635 0 3D,DD-1 3D DD
## 14 3D_DD_2 12.82152 19.00803 11.82915 0 3D,DD-2 3D DD
## 15 3D_HC_1 15.61470 22.46542 12.30757 0 3D,HC-1 3D HC
## 16 3D_HC_2 13.06168 19.36376 11.71982 0 3D,HC-2 3D HC
## 17 3D_MD_H_1 15.12121 23.89679 13.46639 0 3D,MD,H-1 3D MD
## 18 3D_MD_H_2 16.05637 24.20699 13.25602 0 3D,MD,H-2 3D MD
## 19 3D_MD_H_3 15.93791 24.16741 13.30490 0 3D,MD,H-3 3D MD
## 20 3D_MD_L_1 16.88504 24.44692 13.12621 0 3D,MD,L-1 3D MD
## 21 3D_MD_L_2 17.04056 24.58595 13.13466 0 3D,MD,L-2 3D MD
## 22 3D_MD_L_3 16.92258 24.52121 13.11874 0 3D,MD,L-3 3D MD
## 23 3D_MD_1 12.16723 17.52937 12.12537 0 3D,MD-1 3D MD
## 24 3D_MD_2 13.04085 19.53948 11.68423 0 3D,MD-2 3D MD
## level replicate
## 1 None 1
## 2 None 2
## 3 None 1
## 4 None 2
## 5 None 1
## 6 None 2
## 7 H 1
## 8 H 2
## 9 H 3
## 10 L 1
## 11 L 2
## 12 L 3
## 13 None 1
## 14 None 2
## 15 None 1
## 16 None 2
## 17 H 1
## 18 H 2
## 19 H 3
## 20 L 1
## 21 L 2
## 22 L 3
## 23 None 1
## 24 None 2
Exercise 3.1 Compare the medians and standard deviations between samples. Do you see clear differences in overall intensity between groups? Which sample has the most missing values?
3.2 Missing data pattern
missing_df <- reshape2::melt(is.na(protein_matrix))
colnames(missing_df) <- c("protein_id", "sample_id", "missing")
ggplot(missing_df, aes(x = sample_id, y = protein_id, fill = missing)) +
geom_tile() +
scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "grey90")) +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5)
) +
labs(
title = "Missing data pattern",
subtitle = paste0(
"Red = missing (",
round(mean(missing_df$missing) * 100, 1),
"% of all values)"
),
x = "Sample",
y = "Proteins"
)Exercise 3.2 Are missing values evenly distributed across samples or do some samples have more missing data? Do you see any horizontal bands (proteins missing in many samples)? What would this mean?
3.3 Boxplots and density plots
protein_df <- as.data.frame(protein_matrix)
protein_df$protein_id <- rownames(protein_df)
protein_long <- tidyr::pivot_longer(
protein_df,
cols = -protein_id,
names_to = "sample_id",
values_to = "abundance"
) |>
dplyr::left_join(sample_metadata, by = "sample_id")ggplot(protein_long, aes(x = sample_id, y = abundance, fill = condition)) +
geom_boxplot(outlier.size = 0.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = "Abundance distribution by sample",
x = "Sample",
y = "log2 intensity"
)ggplot(protein_long, aes(x = abundance, colour = sample_id)) +
geom_density() +
theme_minimal() +
labs(
title = "Density of protein abundances",
x = "log2 intensity",
y = "Density"
) +
theme(legend.position = "none")Exercise 3.3 Do any samples look clearly shifted in median or spread compared to the others? Would normalisation help to align the distributions?
3.4 Normalisation (Day 3 style)
Here we apply quantile normalisation as in Day 3 to make intensity distributions comparable across samples. From this point on we will use the normalised matrix.
protein_matrix_norm <- limma::normalizeBetweenArrays(
protein_matrix,
method = "quantile"
)
cat("Dimensions after normalization:", dim(protein_matrix_norm), "\n")## Dimensions after normalization: 3598 24
# Quick boxplot comparison (optional)
par(mfrow = c(1, 2))
boxplot(protein_matrix,
main = "Before quantile normalization",
las = 2, cex.axis = 0.6, ylab = "log2 intensity")
boxplot(protein_matrix_norm,
main = "After quantile normalization",
las = 2, cex.axis = 0.6, ylab = "normalized intensity")Exercise 3.4 Compare the boxplots before and after normalisation. How does quantile normalisation change the global intensity distributions? Are the medians now aligned?
3.5 PCA
We use the normalised matrix and remove proteins with zero variance.
var_per_protein <- apply(protein_matrix_norm, 1, var, na.rm = TRUE)
non_zero_var <- var_per_protein > 0
pca_data <- t(protein_matrix_norm[non_zero_var, ])
pca_result <- prcomp(pca_data, scale. = FALSE)
pca_df <- data.frame(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
sample_id = rownames(pca_result$x)
) |>
dplyr::left_join(sample_metadata, by = "sample_id") |>
dplyr::mutate(
condition = factor(condition, levels = c("HC", "DD", "MD")),
dimension = factor(dimension, levels = c("2D", "3D")),
level = factor(level, levels = c("H", "L", "None"))
)
var_explained <- summary(pca_result)$importance[2, 1:2] * 100
ggplot(
pca_df,
aes(x = PC1, y = PC2, colour = condition, shape = dimension)
) +
geom_point(size = 4) +
theme_minimal() +
labs(
title = "PCA of samples",
x = paste0("PC1 (", round(var_explained[1], 1), "%)"),
y = paste0("PC2 (", round(var_explained[2], 1), "%)")
) +
scale_color_brewer(palette = "Set1")ggplot(
pca_df,
aes(x = PC1, y = PC2, colour = condition, shape = level)
) +
geom_point(size = 4) +
ggrepel::geom_text_repel(
aes(label = sample_id),
size = 3,
show.legend = FALSE,
max.overlaps = Inf
) +
facet_wrap(~ dimension) +
theme_minimal() +
labs(
title = "PCA analysis - biological conditions",
x = paste0("PC1 (", round(var_explained[1], 1), "%)"),
y = paste0("PC2 (", round(var_explained[2], 1), "%)")
) +
scale_color_brewer(palette = "Dark2")Exercise 3.5 Do samples cluster more by condition (DD, HC, MD) or by dimension (2D vs 3D)? Do you see any sample that does not group with its expected class? What does PC1 vs PC2 tell you about the dominant sources of variation?
3.6 Sample correlation heatmap
cor_matrix <- cor(protein_matrix_norm, use = "pairwise.complete.obs")
ann_cols <- sample_metadata |>
dplyr::select(sample_id, condition, dimension, level) |>
as.data.frame()
rownames(ann_cols) <- ann_cols$sample_id
ann_cols$sample_id <- NULL
ann_colors <- list(
condition = c(DD = "#E41A1C", HC = "#377EB8", MD = "#4DAF4A"),
dimension = c(`2D` = "#984EA3", `3D` = "#FF7F00"),
level = c(H = "#E41A1C", L = "#377EB8", None = "grey70")
)
pheatmap(
cor_matrix,
annotation_col = ann_cols,
annotation_row = ann_cols,
annotation_colors = ann_colors,
color = colorRampPalette(c("blue", "white", "red"))(50),
main = "Sample correlation matrix"
)Exercise 3.6 Which samples show the highest similarity? Do any samples have systematically lower correlations to all others? Do samples cluster by biological condition or technical factors?
3.7 Hierarchical clustering heatmap
Here we perform a simple imputation (row mean) for proteins that still have a few missing values.
complete_proteins <- rowSums(is.na(protein_matrix_norm)) <= ncol(protein_matrix_norm) * 0.3
filtered_matrix <- protein_matrix_norm[complete_proteins, ]
for (i in seq_len(nrow(filtered_matrix))) {
missing_idx <- is.na(filtered_matrix[i, ])
if (any(missing_idx)) {
row_mean <- mean(filtered_matrix[i, ], na.rm = TRUE)
if (is.finite(row_mean)) {
filtered_matrix[i, missing_idx] <- row_mean
}
}
}
filtered_matrix <- filtered_matrix[
apply(filtered_matrix, 1, function(x) sd(x, na.rm = TRUE) > 0),
]
annotation_col <- sample_metadata |>
dplyr::select(sample_id, condition, dimension, level) |>
as.data.frame()
rownames(annotation_col) <- annotation_col$sample_id
annotation_col$sample_id <- NULL
pheatmap(
filtered_matrix,
scale = "row",
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
annotation_col = annotation_col,
show_rownames = FALSE,
main = "Hierarchical clustering of samples",
color = colorRampPalette(c("blue", "white", "red"))(50)
)Exercise 3.7 Do samples form clear clusters by condition or by dimension in this heatmap? Are there any samples that cluster unexpectedly?
4. Day 3 style differential expression with limma
We focus on a clear contrast:
- 3D samples only (more physiologically relevant)
- DD vs HC (disease vs healthy control)
- level “None” (baseline, no additional treatment)
4.1 Subset data
analysis_samples <- sample_metadata |>
dplyr::filter(
dimension == "3D",
condition %in% c("DD", "HC"),
level == "None"
)
analysis_samples## sample_id raw_label dimension condition level replicate
## 1 3D_DD_1 3D,DD-1 3D DD None 1
## 2 3D_DD_2 3D,DD-2 3D DD None 2
## 3 3D_HC_1 3D,HC-1 3D HC None 1
## 4 3D_HC_2 3D,HC-2 3D HC None 2
## [1] 3598 4
4.2 Design matrix
group <- factor(analysis_samples$condition, levels = c("HC", "DD"))
design <- model.matrix(~ group)
colnames(design) <- c("Intercept", "DD_vs_HC")
design## Intercept DD_vs_HC
## 1 1 1
## 2 1 1
## 3 1 0
## 4 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
Exercise 4.1 Why is it important to set the reference level to HC here? What would change if we set DD as the reference?
4.3 Fit limma model
# Data are already quantile-normalised, so we fit limma directly
fit <- lmFit(expr_mat, design)
fit <- eBayes(fit)
de_results <- topTable(
fit,
coef = "DD_vs_HC",
number = Inf
)
de_results <- de_results |>
dplyr::mutate(protein_id = rownames(de_results)) |>
dplyr::left_join(protein_annotation, by = "protein_id")
head(de_results)## logFC AveExpr t P.Value adj.P.Val B
## 1 25.93082 12.96541 2320.181 1.993267e-08 2.551756e-05 6.710372
## 2 25.58434 12.79217 2257.383 2.124090e-08 2.551756e-05 6.709673
## 3 -28.76748 14.38374 -2048.241 2.660690e-08 2.551756e-05 6.706864
## 4 -25.79229 12.89615 -1923.637 3.077111e-08 2.551756e-05 6.704741
## 5 20.96644 10.48322 1809.388 3.546076e-08 2.551756e-05 6.702399
## 6 20.91313 10.45656 1614.529 4.617350e-08 2.768871e-05 6.697212
## protein_id gene_names gene_symbol
## 1 A8KAQ5;A0A024QZD5;P08621 SNRP70;SNRNP70 SNRP70
## 2 Q56A80;P35475;A8K701;D3DVN7;D3DVN8 IDUA IDUA
## 3 W8QEH3 LMNA LMNA
## 4 Q13884;A0A3B3ITC2 SNTB1 SNTB1
## 5 B4DWW8;E7EVX8;Q8WWY3;E7EU94;E7ESX0;E7EN72 PRPF31 PRPF31
## 6 B7Z3I9;A0A140VJL9;A0A024R877;Q6ZMU0;P13716 ALAD ALAD
## protein_name
## 1 U1 small nuclear ribonucleoprotein 70 kDa
## 2 Alpha-L-iduronidase
## 3 <NA>
## 4 Beta-1-syntrophin
## 5 U4/U6 small nuclear ribonucleoprotein Prp31
## 6 Delta-aminolevulinic acid dehydratase
Exercise 4.2 How many proteins have adjusted p value below 0.05? How many have both adjusted p value < 0.05 and absolute log2 fold change > log2(1.5)? What percentage of tested proteins are significant?
4.4 Volcano plot with gene names and significance
This follows the style of Day 3 section 3.4.3 in the course.
volcano_data <- de_results |>
dplyr::mutate(
significance = dplyr::case_when(
adj.P.Val < 0.05 & logFC > log2(1.5) ~ "Up",
adj.P.Val < 0.05 & logFC < -log2(1.5) ~ "Down",
TRUE ~ "NS"
),
label = dplyr::if_else(
!is.na(gene_symbol) & gene_symbol != "",
gene_symbol,
protein_id
)
)
ggplot(volcano_data, aes(x = logFC, y = -log10(adj.P.Val), colour = significance)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("Up" = "red", "Down" = "blue", "NS" = "grey")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-log2(1.5), log2(1.5)), linetype = "dashed") +
theme_minimal() +
labs(
title = "Volcano plot: 3D DD vs HC",
x = "log2 fold change (DD vs HC)",
y = "-log10 adjusted p value"
) +
theme(legend.title = element_blank())Optional labels for top hits:
top_labels <- volcano_data |>
dplyr::filter(significance != "NS") |>
dplyr::arrange(adj.P.Val) |>
dplyr::slice(1:15)
ggplot(volcano_data, aes(x = logFC, y = -log10(adj.P.Val), colour = significance)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("Up" = "red", "Down" = "blue", "NS" = "grey")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-log2(1.5), log2(1.5)), linetype = "dashed") +
ggrepel::geom_text_repel(
data = top_labels,
aes(label = label),
size = 3,
max.overlaps = Inf
) +
theme_minimal() +
labs(
title = "Volcano plot with top proteins labelled",
x = "log2 fold change (DD vs HC)",
y = "-log10 adjusted p value"
) +
theme(legend.title = element_blank())Exercise 4.3 Look at the proteins highlighted in the labelled volcano. Can you recognise any genes related to muscle structure or dystrophy?
4.5 Heatmap of top proteins with gene names
top_proteins <- de_results |>
dplyr::arrange(adj.P.Val) |>
dplyr::slice(1:50) |>
dplyr::pull(protein_id)
heatmap_mat <- expr_mat[top_proteins, analysis_samples$sample_id]
row_genes <- protein_annotation$gene_symbol[
match(top_proteins, protein_annotation$protein_id)
]
row_labels <- ifelse(
is.na(row_genes) | row_genes == "",
top_proteins,
row_genes
)
rownames(heatmap_mat) <- make.unique(row_labels)
ann_col <- analysis_samples |>
dplyr::select(sample_id, condition, dimension, level) |>
as.data.frame()
rownames(ann_col) <- ann_col$sample_id
ann_col$sample_id <- NULL
pheatmap(
heatmap_mat,
annotation_col = ann_col,
show_rownames = TRUE,
main = "Top 50 proteins - 3D DD vs HC",
color = colorRampPalette(c("blue", "white", "red"))(50)
)Exercise 4.4 Do DD and HC samples form separate clusters based on the top 50 proteins? Are there any samples that do not follow the general pattern?
5. Day 4 style Functional enrichment
5.1 Functional enrichment (g:Profiler)
Differential expression highlights individual proteins that change between conditions. Functional enrichment aims to understand whether groups of these proteins participate in related biological processes, pathways or phenotypes.
Here we follow a structure similar to Day 4 of the r4proteomics
course: 1. Build a set of significantly changed genes for ORA. 2. Create
a ranked gene list using the limma moderated t statistic. 3. Run
g:Profiler on the significant gene set. 4. Visualise results with both
gostplot() and a publication-style barplot. 5. Export gene
lists for external tools.
5.1.1 Prepare gene sets for ORA and ranked analysis
We select significantly changed proteins using an adjusted p value cutoff and a minimum fold change. We also build a ranked list of genes using the moderated t statistic, which is the recommended signed statistic for GSEA-style analyses.
Note: Often, when performing enrichment analysis it is good to provide a list of genes or proteins to use as background. This is a list of all genes or proteins for which we have detected signal.
# Significant proteins for ORA (over-representation analysis)
de_significant <- de_results |>
dplyr::filter(
!is.na(gene_symbol),
gene_symbol != "",
adj.P.Val < 0.05,
abs(logFC) > log2(1.5)
) |>
dplyr::arrange(adj.P.Val)
cat("Significant DE proteins for ORA:", nrow(de_significant), "\n")## Significant DE proteins for ORA: 409
## [1] 402
## [1] "SNRP70" "IDUA" "LMNA" "SNTB1" "PRPF31" "ALAD"
# Background: all quantified proteins with a gene symbol
bg_genes <- de_results |>
dplyr::filter(!is.na(gene_symbol), gene_symbol != "") |>
dplyr::pull(gene_symbol) |>
unique()
cat("Background genes (custom_bg):", length(bg_genes), "\n")## Background genes (custom_bg): 3464
# Ranked gene list for GSEA-style workflows using moderated t-statistic
ranked_tbl <- de_results |>
dplyr::filter(!is.na(gene_symbol), gene_symbol != "") |>
dplyr::select(gene_symbol, t) |>
dplyr::group_by(gene_symbol) |>
dplyr::summarise(t_stat = t[which.max(abs(t))], .groups = "drop") |>
dplyr::arrange(dplyr::desc(t_stat))
gene_list <- ranked_tbl$t_stat
names(gene_list) <- ranked_tbl$gene_symbol
cat("Ranked list genes:", length(gene_list), "\n")## Ranked list genes: 3464
## SNRP70 IDUA PRPF31 ALAD PIK3C3 WISP2
## 2320.181 2257.383 1809.388 1614.529 1366.613 1100.186
Exercise 5.1 Change the thresholds for selecting significant proteins and check how
length(sig_genes)changes. What impact do you expect on the enrichment?
5.1.2 Run g:Profiler ORA (GO, Reactome, HP)
We now perform over-representation analysis with gprofiler2. It includes multiple annotation sources including GO, KEGG, Reactome, Human Phenotype Ontology (HP), among others. We use all proteins that are quantified in the dataset (and kept after filtering) as a custom background.
if (length(sig_genes) >= 10) {
gost_res <- gprofiler2::gost(
query = sig_genes,
custom_bg = bg_genes,
domain_scope = "custom",
organism = "hsapiens",
correction_method = "fdr")
enrich_tbl <- gost_res$result |>
dplyr::arrange(p_value) |>
dplyr::mutate(
term_name = stringr::str_trunc(term_name, 60)
)
head(enrich_tbl[, c("term_name", "source", "p_value", "intersection_size")])
} else {
gost_res <- NULL
enrich_tbl <- NULL
cat("Too few significant genes for a meaningful enrichment analysis.\n")
}## term_name source p_value
## 1 supramolecular polymer GO:CC 1.726927e-08
## 2 supramolecular fiber GO:CC 4.247341e-08
## 3 contractile muscle fiber GO:CC 2.065546e-07
## 4 sarcomere GO:CC 8.050059e-07
## 5 collagen-containing extracellular matrix GO:CC 1.058575e-06
## 6 external encapsulating structure GO:CC 1.197663e-06
## intersection_size
## 1 72
## 2 70
## 3 38
## 4 33
## 5 43
## 6 46
Exercise 5.2 Which annotation sources dominate the top enriched terms? Do you see processes or phenotypes that are relevant for muscle or dystrophy?
5.1.3 Visualisation with gostplot()
gProfiler provides a built-in visualisation that summarises enriched terms.
Exercise 5.3 What can you interpret from the global enrichment plot? What additional insight do the phenotype terms provide?
5.1.4 Publication-style barplot of top enriched terms
In addition to the gProfiler visualisation, it is useful to include a compact barplot showing the top enriched terms ranked by significance. This is the same type of plot used in Day 4.
if (!is.null(enrich_tbl)) {
top_terms <- enrich_tbl[1:min(30, nrow(enrich_tbl)), ]
ggplot(
top_terms,
aes(
x = reorder(term_name, -log10(p_value)),
y = -log10(p_value),
fill = source
)
) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(
title = "Top enriched terms (g:Profiler)",
x = "Term",
y = "-log10 p value"
)
}Exercise 5.4 Compare this barplot with the gProfiler visualisation. Which representation do you find easier to interpret?
5.1.5 Export gene lists for online tools
We export both the significant gene set and the ranked list. These can be used for STRING, g:Profiler web, Enrichr or GSEA-like tools.
dir.create("results", showWarnings = FALSE, recursive = TRUE)
# Simple list of significant genes
out_sig <- "results/PXD052728_3D_DD_vs_HC_sig_genes.txt"
write.table(
sig_genes,
file = out_sig,
row.names = FALSE,
col.names = FALSE,
quote = FALSE
)
cat("Significant gene list saved to:", out_sig, "\n")## Significant gene list saved to: results/PXD052728_3D_DD_vs_HC_sig_genes.txt
# Ranked list using moderated t-statistics
out_ranked <- "results/PXD052728_3D_DD_vs_HC_ranked_genes.tsv"
ranked_out <- tibble::tibble(
gene = names(gene_list),
t_stat = as.numeric(gene_list)
)
readr::write_tsv(ranked_out, out_ranked)
cat("Ranked gene list saved to:", out_ranked, "\n")## Ranked gene list saved to: results/PXD052728_3D_DD_vs_HC_ranked_genes.tsv
Exercise 5.5 Upload your gene lists to online tools such as STRING, Enrichr or g:Profiler web. Which tool gives the clearest representation of the biological themes? Are the main biological themes consistent with what you see in the R-based g:Profiler analysis? Which visualisations help you most to explain the biology to someone else?
6. Extensions
Ideas for students to explore on their own:
- Repeat the whole workflow for 2D samples only.
- Compare H vs L within DD or MD in 3D.
- Change thresholds in the volcano plot (for example 1.2 instead of 1.5 fold change) and see how the number of significant proteins changes.
- Export the list of significant genes to external tools such as Enrichr or DAVID and compare the enrichment results.
7. Session Information
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gprofiler2_0.2.4 ggrepel_0.9.6 reshape2_1.4.5 limma_3.64.3
## [5] pheatmap_1.0.13 lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
## [9] dplyr_1.1.4 purrr_1.2.0 readr_2.1.6 tidyr_1.3.1
## [13] tibble_3.3.0 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.54 bslib_0.9.0 htmlwidgets_1.6.4
## [5] tzdb_0.5.0 crosstalk_1.2.2 bitops_1.0-9 vctrs_0.6.5
## [9] tools_4.5.1 generics_0.1.4 curl_7.0.0 parallel_4.5.1
## [13] pkgconfig_2.0.3 data.table_1.17.8 RColorBrewer_1.1-3 S7_0.2.1
## [17] lifecycle_1.0.4 compiler_4.5.1 farver_2.1.2 statmod_1.5.1
## [21] httpuv_1.6.16 htmltools_0.5.8.1 sass_0.4.10 RCurl_1.98-1.17
## [25] yaml_2.3.10 lazyeval_0.2.2 plotly_4.11.0 later_1.4.4
## [29] pillar_1.11.1 crayon_1.5.3 jquerylib_0.1.4 cachem_1.1.0
## [33] mime_0.13 tidyselect_1.2.1 digest_0.6.38 stringi_1.8.7
## [37] bookdown_0.45 labeling_0.4.3 rmdformats_1.0.4 fastmap_1.2.0
## [41] grid_4.5.1 cli_3.6.5 magrittr_2.0.4 utf8_1.2.6
## [45] withr_3.0.2 promises_1.5.0 scales_1.4.0 bit64_4.6.0-1
## [49] timechange_0.3.0 rmarkdown_2.30 httr_1.4.7 otel_0.2.0
## [53] bit_4.6.0 hms_1.1.4 shiny_1.11.1 evaluate_1.0.5
## [57] knitr_1.50 viridisLite_0.4.2 rlang_1.1.6 Rcpp_1.1.0
## [61] xtable_1.8-4 glue_1.8.0 rstudioapi_0.17.1 vroom_1.6.6
## [65] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9